On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases

نویسندگان

  • Goetz Graefe
  • Usama M. Fayyad
  • Surajit Chaudhuri
چکیده

For a wide variety of classification algorithms, scalability to large databases can be achieved by observing that most algorithms are driven by a set of sufficient statistics that are significantly smaller than the data. By relying on a SQL backend to compute the sufficient statistics, we leverage the query processing system of SQL databases and avoid the need for moving data to the client. We present a new SQL operator (Unpivot) that enables efficient gathering of statistics with minimal changes to the SQL backend. Our approach results in significant increase in performance without requiring any changes to the physical layout of the data. We show analytically how this approach outperforms an alternative that requires changing in the data layout. We also compare effect of data representation and show that a “dense” representation may be preferred to a “sparse” one, even when the data are fairly sparse.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Green envelopes classification: the comparative analysis of efficient factors on the thermal and energy performance of green envelopes

This paper classifies green envelopes as green roofs and green walls according to effective factors, which were derived from literature to compare the green envelopes’ thermal and energy performance in a more effective way. For this purpose, an extensive literature review was carried out by searching keywords in databases and studying related journal papers and articles. The research meth...

متن کامل

An algorithm for the anchor points of the PPS of the CCR model

Anchor DMUs are a new class in the general classification of Decision Making Units (DMUs) in Data Envelopment Analysis (DEA). An anchor DMU in DEA is an extreme-efficient DMU that defines the transition from the efficient frontier to the free-disposability part of the boundary of the Production Possibility Set (PPS). In this paper, the anchor points of the PPS of the CCR model are investigated....

متن کامل

کاهش ابعاد داده‌های ابرطیفی به منظور افزایش جدایی‌پذیری کلاس‌ها و حفظ ساختار داده

Hyperspectral imaging with gathering hundreds spectral bands from the surface of the Earth allows us to separate materials with similar spectrum. Hyperspectral images can be used in many applications such as land chemical and physical parameter estimation, classification, target detection, unmixing, and so on. Among these applications, classification is especially interested. A hyperspectral im...

متن کامل

Querying Hierarchical Data in Very Large Databases

Hierarchical data, such as Partially Ordered Set (POSET) is tremendously used in relational databases, especially in data mining and data warehouse based-applications. Unfortunately, SQL (Structured Query Language) does not effectively support hierarchical data structure to manage this sort of data, for example, in Oracle, a CONNECT BY operator is used to query data organized into trees, howeve...

متن کامل

3D Detection of Power-Transmission Lines in Point Clouds Using Random Forest Method

Inspection of power transmission lines using classic experts based methods suffers from disadvantages such as highel level of time and money consumption. Advent of UAVs and their application in aerial data gathering help to decrease the time and cost promenantly. The purpose of this research is to present an efficient automated method for inspection of power transmission lines based on point c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998